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Abstract 

The classification of life should be based upon the fundamental mechanism in the evolution 
of life. We found that the global relationships among species should be circular phylogeny, 
which is quite different from the common sense based upon phylogenetic trees. The genealog- 
ical circles can be observed clearly according to the analysis of protein length distributions of 
contemporary species. Thus, we suggest that domains can be defined by distinguished phylo- 
genetic circles, which are global and stable characteristics of living systems. The mechanism in 
genome size evolution has been clarified; hence main component questions on C-value enigma 
can be explained. According to the correlations and quasi-periodicity of protein length distri- 
butions, we can also classify life into three domains. 
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1 Background and motivation 

In the absence of any ancient genetic sequences, scientists in the field of molecular evolution 
have to figure out reasonable mechanisms to retrieve the evolutionary history according to the 
genetic information of contemporary species. Traditionally, the basis for a natural taxonomy 
was provided by complex morphologies and a detailed fossil record. With the sequencing revo- 
lution, we had a new opportunity to understand the richer and more credible information on the 
evolution of life stored in the molecular sequences. Consequently, the basis for the definition of 
taxa has progressively shifted from the organismal to the cellular to the molecular level. Based 
upon rRNA sequence comparisons, life on this planet can be divided into three domains: the 
Bacteria, the Archaea, and the Eucarya [[T]|. The differences that separate the three domains are 
of a more profound nature than the differences that separate classical five kingdoms (Monera, 
Protista, Fungi, Plantae, Animalia). 

The protein length evolution is poorly understood at present. The protein lengths vary no- 
tably both within a proteome and among species, and the average protein lengths of eukaryotes 
are longer than the average protein lengths of prokaryotes in general |l2l Q. But there are fac- 
tors to increase or to decrease protein length, and it is still unclear whether in general protein 
tends to increase in length. [|4l| Q O. Abound evidence indicates that there is underlying or- 
der in protein sequence organization. It is generally supposed that there are various structural 
and functional units in protein sequences. Periodicity was observed in protein length distri- 
butions [|7]| (HI. There is evidence for short-range correlation of protein lengths according to 
investigation by detrended fluctuation analysis BUl ifTOl . The correspondence between biology 
and linguistics at the level of sequence and lexical inventories, and of structure and syntax, has 
fuelled attempts to describe genome structure by the rules of formal linguistics IfTTTl [fT2ll . So 
Zipf's law, originally found in linguistics, can be used to study the rank-size distribution of 
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protein lengths EOl [fT2ll. 

We found that protein length distributions can be taken as concise and comprehensive 
records of the evolution of life. The protein length should not be taken as a random quan- 
tity. The orders in protein lengths have been recorded in protein length distributions. We found 
profound relationship between the protein length evolution and genome size evolution. So we 
may unravel the mechanism of genome size evolution by the properties of protein length dis- 
tributions. We found that the global taxonomy of life can be illustrated as phylogenetic circles 
rather than phylogenetic trees. Considering that phylogenetic circles are stable characteristics 
of living systems, we suggest that the circular phylogenetic relationship can be taken as a new 
criterion to identify domains. 

The motivation of this work is to study the mechanisms in genome evolution based on 
properties of protein length distributions; consequently we can classify life in a global scenario 
of phylogenetic circles. We can explain (i) the trend of genome size evolution at the levels of 
domains and phyla, (ii) the patterns of genome size distributions, (iii) the bidirectional driving 
force in genome size evolution. At last, we successfully classify life into three domains based 
on properties of protein length distributions. 

When trying to infer the early history of life according to the present biological data, we 
can borrow some smart ideas in physics. There is an analogy between the study of stellar 
evolution based on present experimental data of stellar spectra and the current task to infer 
the evolutionary history of life based on the protein length distributions. Although only the 
contemporary data can be observed in both cases, we can take the current states of stars, or of 
species, as various stages of their evolution. In the former case in astronomy, the Hertzsprung- 
Russell diagram shows a group of stars in various stages of their evolution according to the 
relation of absolute magnitude to stellar color, which is helpful to understand stellar evolution 
ifTSl lfT4l . In the latter case in the study of molecular evolution, some similar diagrams can also 
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be plotted to show a group of species in various stages of their evolution based on protein length 
distributions. 

2 Correlation analysis and spectral analysis of protein length 
distributions 

Data collection. The data process in this paper is based on the biological data. In most calcula- 
tions based on biological data in the paper, the protein length distributions are obtained from the 
data ofn = 106 complete proteomes {n^ = 85 bacteria, = 12 archaea, n'^ = 7 eukaryotes 
and = 2 viruses) in the database Predictions for Entire Proteomes (PEP) jTSj. Only in the 
cases when we study the detailed properties of genealogical circles and bifurcation of genome 
size distribution, the protein length distributions are obtained from both n = 106 species in PEP 
and 775 species in the National Center for Biotechnology Information (NCBI). 

We denote s(a) as the genome size of species a and r]{a) as the ratio of non-coding DNA 
to the total genetic DNA of species a. The data of r](a) and s{a) are obtained from Ref. [fT6l . 
where there are 54 species (6 eukaryotes, 5 archaebacteria and 43 eubacteria) can be also found 
in PEP. The gene numbers N are obtained by the numbers of Open Reading Frames (ORFs) in 
proteomes in PEP. There are s{a)r](a) base pairs (bp) non-coding DNA and s(a)(l — i](a)) bp 
coding DNA in the genome of species a. 

Protein length distribution. We can definitely obtain the protein length distribution of a 
species if the lengths of proteins in its proteome are known. In calculation of a protein length 
distribution, we only concern the protein-coding genes and count only once for a gene with 
more than one copies. The transposable elements contribute little in calculation of protein 
length distributions. For instance, there are only dozens of genes appear to have been derived 
from transposable elements in human genome [fTTll . 
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The protein length distribution of a species a can be denoted by a vector 

x(q;) = {xi{a),X2{a), Xk{a), ...), (1) 

where there are Xk{a) proteins, whose lengths are just k amino acids (a.a.), in the entire pro- 
teome of this species (Fig. la). The average protein length in the proteome of species a can be 
calculated by the protein length distribution: 

Ka) = (2) 

And the standard deviation of protein lengths can be calculated by: 



The total protein length distribution of all the species in PEP is denoted by 

X=J]x(a). (4) 

aePEP 

Since there are few quite long proteins, it is practical to choose a sufficient large protein length 
as the cutoff of protein length in the protein length distributions in the data process. Here, we 
set the cutoff as m = 3000 amino acids (a.a.). Almost all the neglected elements of protein 
length distributions, i.e., Xi{a), i > m, vanish according to the biological data in PEP, which 
contribute little in our data analysis. So our conclusions are free from the choice of m. 

A peak in the fluctuations of protein length distribution x(q;) can be distinguished when 
xi{a) is greater than both xi^i{a) and xi+i{a). The number of peaks of protein length distri- 
bution x(q;) can be denoted by p{a). There is no smoothing for protein length distributions 
when counting the number of peaks. So p{a) can be obtained rigorously for any species whose 
proteome is know. There is profound biological meaning for peak number p. 

Correlation analysis. Given any pair of species a and P in PEP, we will find several 
ways to evaluate the correlation between the protein length distributions of any pair of species. 
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Accordingly, we can calculate the corresponding average correlation between any species and 

all the species in PEP. The correlation polar angle 6{a) of species a is defined as the angle 
between vectors x(a) and X, i.e. 

2 X • X 

9 = — arccos( vn )' 

TT ||x|| ||X|| 

where the factor - is added in order that the value of 9 ranges from to 1. The less the value of 
9{a) is, the closer the average correlation of protein length distribution for species a is. 

The correlation coefficient of protein length distributions between species a and (3 is defined 

by 

^/ ^ Er=i(^fc(«) - x{a)){xk{(3) - x{f3)) 

where x{a) — ^ Ylk=i ^k{ci)- And the average correlation coefficient of species a can be 
defined by 

The value of R{a) ranges from to 1. The more the value of R is, the closer the average 
correlation of protein length distributions is. 

The Minkowski distance between species a and /3 is defined by 



m 

^ " 1/q 



k=l 



(8) 



ix(«)ii wmw 

where g is a parameter. And the average Minkowski distance of species a can be defined by 

^(«) = i^E^(«'/^)- (9) 

/3epEP 

The less the value of D is, the closer the average correlation of protein length distributions is. 

Spectral analysis. We can study the order in the fluctuations in the protein length distri- 
butions by the method of spectral analysis. The discrete fourier transformation of the protein 



6 



length distribution x(q;) is: 

^ m 

^ («) ^ ^ y Xfe(a)e2-(^-i)(/-i)/-. (10) 

The power spectrum, i.e., the abstract of the discrete fourier transformation, is defined as (Fig. 
lb) 

y{a) = ||x(a)|| = V(Re x(a))2 + (Im x(q;))2. (11) 

The power spectrum y = (yi, y^) is mirror symmetric between (yi, and (ym/2+1, Vm) 
according to the properties of discrete Fourier transformation. 

The peaks in the power spectrum y{a) relate to the periodicity of fluctuations in the pro- 
tein length distribution x(q;). In the following, we only considered the left half of the power 
spectrum (yi, ym/2), while the properties on the right half are alike by mirror symmetry. Be- 
sides, we neglected the power spectrum at very low frequency where the peaks are always high 
due to the general bell- shape profiles of protein length distributions. The high frequency sector 
refers to the power spectrum at / near to m/2, and the low frequency sector refers to the power 
spectrum at / much less than m/2. The characteristic frequency of the highest peak in the left 
half of the power spectrum (^1(0;), ...,ym/2{oi)) can be denoted by fc{oi) (Fig. lb). Moreover, 
we can find the top Up highest peaks in the fluctuations of the left half of the power spectrum. 
The maximum frequency of the frequencies for the above top Up highest peaks can be denoted 
by fm{oi), whose original intention is to determine an obvious peak with large frequency. 7a - 
7c). And we defined the characteristic period Lc and minimum period of the fluctuations of 
protein length distribution as follows: 

L,{a) = m//e (12) 
L^{a) = m/fm, (13) 

which are free of the choice of the cutoff m. We chose rit — 30 for and L'^ and rit — 80 for 
and in the calculation. 
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The average power spectra for three domains (Bacteria, Archaea and Eucarya) are as follows 
respectively: 

a G Bacteria 
CkG Archaea 
CK^ Eucarya 

3 Calculation of genome size and non-coding DNA content 

Calculation of genome size. The genome size evolution is one of the central problems in the 
study of molecular evolution because it is a macroevolutionary question and is helpful to under- 
stand the large-scale patterns in the history of life [fT8llfT9l . We had found a close relationship 
between genome size s and the correlation 9 of protein length distributions and non-coding 
DNA content r]ma previous work EOll . hence the genome size of contemporary species can be 
calculated by an experimental formula with two variants. In this paper, we also found a close 
relationship between genome size s and the peak number p, then we obtained another single- 
variant experimental formula to calculate the genome sizes. According to the relationship be- 
tween the two formulae, we can obtain an experimental formula to calculate the non-coding 
DNA content 77 only based on the data of coding DNA. This interesting result infers that the 
non-coding DNA content depends on the coding DNA. 

In the previous work, we found that the genome size s relates to two variants: the non-coding 
DNA content r] and the correlation polar angle 9. Hence we had obtained a double- variant 
experimental formula to calculate the genome size of a certain species ||20l 

r] 9 

s{r],9) = soexp{ -), (17) 

a 

where sq = 7.96 x 10^ bp, a = 0.165 and b = 0.176. The crux to obtain this formula is 
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to find the proportional relationsliip between genome size s and correlation polar angle 6 for 
prokaryotes. The biological meaning of 9 (a) is the average correlation of protein length dis- 
tributions between species a and all the other species. Furthermore, we naturally introduced 
the second variant i] in the formula so that this formula can be generalized for eukaryotes. For 
the prokaryotes, the non-coding DNA contents are about 10 percent, but the correlation polar 
angles range from about 0.6 to 0.1. For the eukaryotes, the correlation polar angles are around 
0.1, but the non-coding DNA contents range from 0.1 to near 1. This double-variant formula is 
well-predicted not only for prokaryotes but also for eukaryotes. We also proposed a formula to 
describe the trend of genome size evolution [|20l 



,(t)=„exp(!?W-^), (18) 

a b 

Thus the dynamic parameter r](t) and 9{t) become promoting factor and hindering factor in 
determining the trend of genome size evolution. 

Li this paper, we found another single- variant experimental formula to calculate the genome 
size. We found that there is an exponential relationship between genome size s and number of 
peaks p: 

sip)=s'exp{^), (19) 
Po 

where s' = 8.36 x lO'' bp and po = 70.6 are determined by least squares. There is only 
one variant p in this formula. The prediction of genome sizes by this formula agrees with the 
biological data of genome sizes very well (Fig. 2a). The single-variant formula is also valid not 
only for prokaryotes but also for eukaryotes. Therefore, the genome sizes for both prokaryotes 
and eukaryotes can be investigated in a unified framework. 

Calculate of non-coding DNA content. In terms of the relationship between the above two 
experimental formulae, we obtained an experimental formula to calculate the non-coding DNA 
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content: 

r]{a) = 0.938 9{a) + 0.00234 p{a) - 0.752, (20) 

where both 9 and p are defined only based upon the protein length distributions. The prediction 
of the non-coding DNA contents agrees with the experimental observations (Fig. 2b). Accord- 
ing to the formulae to calculate s and rj, we can calculate the size of coding DNA (1 — r^)s as 
well as the size of non-coding DNA rjs according to the value of p and 9 for any species. At first 
thought, such a result is quite surprising. We can obtain the size of coding DNA directly from 
the protein length distribution: (1 — r])s — T,^iixi. There should be no direct evidence about 
the size of non-coding DNA according to the protein lengths in the coding DNA segment. 

This result is profound because it shows that there is a close relationship in sizes between 
non-coding DNA segment rjs and coding DNA segment {1 — r])s. The evolution of non-coding 
DNA relates to the evolution of coding DNA, whose functions may relate closely to the cellular 
differentiation. The size of non-coding DNA can not be arbitrary if the protein length distribu- 
tion X is given. The information about coding DNA is stored in the components Xi of protein 
length distribution x, where the order of components is irrelative to the result of calculation. 
The order of these component Xi, i.e., the order of fluctuations of protein length distribution x 
becomes significant for calculating the size of non-coding DNA. So there is additional evolu- 
tionary information stored in the fluctuations of protein length distributions. 

The variant p can be obtained directly from the protein length distribution of the species' 
own, but the other variant 9 depends on the data of protein length distributions of other species. 
The crux to define 9 is to calculate the correlation of fluctuations of protein length distributions 
X and X. It indicates that there is a universal mechanism for the genome size evolution for all 
the species. So the variant 9, as an average value of correlation, is essential in calculation of 
non-coding DNA content. 

Relationship between p and rj, 9. Genome size evolution provides a clear example of 
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hierarchy in action. No one-dimensional explanation can account for the massive variation 
in eukaryotic genome sizes [19]. The success of the double-variant formula to calculate the 
genome size benefits from the proper choice of two variants 9 and rj. But why can we also find 
a formula to calculate genome size with only one variant pi The correlation between genome 
size s and peak number p can not be explained trivially by the observation that genome size and 
proteome size are correlated. The linear relationship between p and logio ^ shows that p is an 
intrinsic genomic property of a species. The relationship between p and logio ^ non-linear 
(Fig. 2e). The fluctuations of protein length distributions can no longer be taken as random 
fluctuations on the smooth background, which reflects the complexity of proteome and relates 
to the complexity of life. 

We can understand the biological meaning of peak number according to the relationship 
between the single variant p and the pair of variants rj and 0. Firstly, we studied the relationship 
between p and r] based on the biological data (see the distribution of species in Fig. 2c). We 
found that there is a critical value Pc of peak number, which definitely separates prokaryotes 
and eukaryotes in the p — r] plane. For prokaryotes, p is less than pc and rj is about constant. The 
distribution of species in the rf — p plane {p < pc) consists a rightward triangle, which agrees 
with another triangle distribution of prokaryotes in s — rj plane (Fig. 3f) due to the correlation 
between peak number p and genome size s. The deviation of i] from average value 0.1 becomes 
smaller and smaller when p goes to Pc- For eukaryotes, p is greater than pc and r] increases with 
p. The distribution of species in the rj — p plane {p > Pc) consists a leftward triangle. The 
deviation of 7] becomes bigger an bigger when p goes away from pc. So there are few species in 
the area p ~ Pc- Such a regular distribution of species in the p — r] plane indicates a profound 
relationship between p and r]. So p can not be meaningless in biology. Secondly, we studied 
the relationship between p and 9 (Fig. 2d). We found that 9 declines with p for prokaryotes 
when p < Pc, while 9 is about constant for eukaryotes when p > p^. We point out that p 
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relates closely to the complexity of life. It was suggested that the non-coding DNA content r] 
indicates the complexity of eukaryotes: the larger the value of 7] (corresponding to larger p) is, 
the more the complexity of eukaryotes is 11211 . In the case of prokaryotes, the genome sizes, 
which are proportional to the gene numbers, can indicate the complexity of prokaryotes because 
the non-coding DNA contents are about the same for prokaryotes. Thus, the larger genome size 
(corresponding to larger p) is, the more complexity of prokaryotes is. A natural meaning of 
peak number p is an index for the complexity of structures of any protein length distribution. 
Summarizing the above, we suggest that peak number p indicates the meaning of complexity of 



In order to understand peak number p more clearly, we deduced its evolutionary formula 
according to the formula on the evolutionary trend of genome size in Ref. GUI : 



where = —560 Million years (Myr) (t = for today) and ti = 644 Myr, T2 = 106 Myr, 
si = 1.98 X 10^ bp and S2 = 1.65 x 10^ bp. We obtained that there were two stages in the 
evolution of peak number: 



The critical peak number in the evolution is Pc = p(Tc) = po In y = 324. We found that 
peak number evolves much faster in the period after than in the period before Tc. Since 
peak number did not evolve evenly, it can not be regarded as an independent variant in the 
evolution. The variant p is underlain by two variants rj and 6. So the genome size always needs 
two-dimensional explanation. 



life. 




t<T, 
t>T, 



(21) 




t<T, 
t>T, 



(22) 
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4 Phylogenetic circles in Al —p plane and bifurcation of genome 
size distribution in A/ — s plane 

Phylogenetic circles in Al — p plane. Previously, we have explained several main problems in 
C-value enigma, such as the genome ranges in taxa and genome size distribution, according to 
the two-variant genome size formula EOl [fTSl . The biological meaning of this formula can be 
understood more clearly when we wrote down its derivative form as follows 



Evidently, there are two factors in control of the genome size evolution. The first variant rj 
is promoter, whose contribution is measured by a, and the other variant 9 is hinderer, whose 
contribution is measured by b. The genome size evolution is a bidirectional course, which may 
either increase or decrease in the evolution. 

We found a miraculous distribution of species in Al — p plane (Fig. 3a). The eukaryotes, 
archaea and eubacteria distribute in three circular areas respectively. The species distribute only 
on the edges of the circles, and it is empty within the circles. It is obvious to form a circle by 
several samples of eukarya, archaea and mycoplasma respectively. Even for eubacteria, we can 
also observe a distribution with an empty center enclosed by a round boundary. The two virus 
are also near to each other. We can conclude that there is almost no exception of species that 
disassociate these observed circles. 

The standard deviation of protein length Al and the peak number in protein length distribu- 
tion p are pivotal properties in studying genome size evolution. Al relates to the variation of 
protein length by its definition. And we can consider p ~ ^ _ | as the "net driving force" in 
genome size evolution. 

Global patterns of genomes size variation at the levels of domains and phyla. In order 
to observe the phylogenetic circles in more detail, we obtained more protein length distribu- 



As 



At] AO 
a b 



(23) 
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tions based on the biological data of 775 microbes (725 eubacteria, 50 archaea) in NCBI. The 
microbial taxonomy in this work is based on the NCBI taxonomy database [[22l| ll23l . Thus, we 
can obtain a detailed distribution of microbes in A/ — p plane. At the level of domains, we can 
also observe two phylogenetic circles for eubacteria and archaea respectively (Fig. 4a). 

Too many proteobacteria (blue legends in Fig. 4a) in the database disturbed us to discern 
the phylogenetic circle of eubacteria easily. So we divide the 725 eubacteria into two groups: 
"the group of 397 proteobacteria" and "the group of the other 328 eubateria". Thus, we can 
discern the phylogenetic circle of eubacteria. For the group of "the other 328 eubateria", we 
can observe a circular chain composed of 10 phyla (Firmicutes, Acidobacteria, Actinobacte- 
ria, Cyanobacteria, Bacteroidetes/Chlorobi, Spirochaetes, Chlamydiae/Verrucomicrobia, Chlo- 
roflexi, Deinococcus-Thermus, Thermotogae) (Fig. 4b). This circular chain shows that the 
global picture of the distribution of eubateria are indeed a phylogenetic circle, although the num- 
bers of species vary greatly among these 10 phyla in the database. For the group of proteobac- 
teria, the species from the five classes (Alphaproteobacteria, Betaproteobacteria, Gammapro- 
teobacteria, Deltaproteobacteria and Epsilonproteobacteria) also form an arch of the phyloge- 
netic circle at the same place of the circular chain of the other 10 phyla (Fig. 4c). Especially, 
the species in the class of Alphaproteobacteria almost form a closed circular distribution. 

According to the detailed observation of phylogenetic circle, we conjecture that the distri- 
bution of species in a same domain form a closed circle in A/ — s plane, while the species in a 
phylum or lower taxon only form an arch of the circle. Namely, the global pattern of genomes 
size variation at the level of domains can be described by phylogenetic circles, while the pattern 
of genomes size variation at the levels of phyla and lower taxa only reflects local properties of 
the corresponding phylogenetic circle. 

Bifurcation of genome size distribution in A/ — s plane. The distribution of species 
in A/ — s plane is very interesting, whose shape likes two wings of a butterfly (Fig. 3b). 
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Considering the close relationship between p and s, this distribution is similar to the one in AZ — 
p plane. We can obviously observe two asymptotes that depart the plane into four quadrants. 
The origin O, i.e. the point of intersection of the asymptotes, corresponds to a special genome 
size s* (Fig. 3b). There is almost no species in the upper and lower quadrants. All bacteria 
gather either in left quadrant or right quadrant; archaea gather also in left or right quadrants, but 
only in the lower parts; all eukaryotes gather in the right quadrant, but in the upper part and far 
away from s*; and the two virus gather in the left quadrant, but in the upper part and far away 
from s*. The distribution of species 'm M — N plane is also similar to the one in A/ — p plane, 
but the circular shapes become worse (Fig. 3h). 

We also obtained a more detailed distribution in AZ — s plane based on the biological data 
of the 775 microbes in NCBI. The overall shape of the distribution also likes a butterfly (Fig. 
4d). Especially, the distribution of proteobacteria in Ai — s plane agrees with a butterfly shape 
very well (Fig. 4e). We observed that the distribution of species in groups Alphaproteobacteria, 
Betaproteobacteria, Gammaproteobacteria also agree with the butterfly shape; while the species 
in groups Deltaproteobacteria and Epsilonproteobacteria distribute on the right wing and left 
wing respectively. 

Though the distributions of archaea and eukaryotes obviously deviate the distribution of 
eubacteria, the overall distribution of all species does not violate the butterfly shape. The places 
of Archaea, Eubateria and Eukarya indicate that, at the level of domain, the greater the standard 
error of the protein length in a proteome is, the greater the genome size is. 

The butterfly shaped distribution of species in A/ — s plane is helpful for us to understand 
the variation of genome sizes, which strongly indicates the bidirectionality in genome size evo- 
lution. The genome size corresponding to the connection point of the two wings in Fig. 4d is 
approximately the same as the genome size corresponding to the center of phylogenetic circle 
in Fig. 4b. In the evolution of genome size for the closely related species, the genome size may 
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either increase or decrease. It can be indicated that the increasing trend is considerably stronger 
than the decreasing trend in genome size evolution, because there are obviously more species 
on the right wing than on the left wing in the distribution of species in A/ — s plane (Fig. 4d 
and 4e). 

5 Unravelling the mechanism of genome size evolution 

Global and local pictures of genome size evolution. A distinguished phylogenetic circle can 
be taken as a natural definition of a domain. The mechanism for the origin of domains is quite 
different from the mechanism for the origin of phyla. There are two significant events in the 
evolution of life: the origin of domains in early stage of evolution and the origin of animal phyla 
around Cambrian period [|20l [|24l|. The phylogenetic circles only exist at the level of domain 
according to the distribution of species in Al — p plane. The properties at the level of domains 
do not couple with the later evolution at the levels of phyla and so forth. So phylogenetic circles 
are stable characteristics and may exist from the early stage in the evolution of life to present 
days. The observation of phylogenetic circles can be explained by Woese's theory on cellular 
evolution [|25l. According to his perspective, horizontal gene transfer is the principal driving 
force in early cellular evolution. The primitive cellular evolution is basically communal, and it 
is not the individual species that evolve at all. So genome cluster evolution is more essential 
than the evolution of an individual genome; and the study of the origin of a living system is 
much more valuable than the study of origin of just one species. 

The phylogenetic circles are "global" properties, while the traditional conception of the phy- 
logenetic trees is "local". Although the phylogenetic tree is useful in many circumstances, it 
has unfortunately misled us to comprehend the panorama of taxonomy of life. In the global 
scenario, we emphasize the phylogenetic circle should evolve as a whole, while in the local 
scenario, a species can evolve freely along continually branching phylogenetic tree. The under- 
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lying mechanism on evolution must be a "global" theory. An individual life can never originate 
and evolve unless it existed as a phylogenetic circle of primordial life. A new theory of evolu- 
tion of life is necessary to understand the evolution of a cluster of genomes as a whole. Then 
we may explain the trajectory of the phylogenetic circles in the M — p plane in the evolution. 
The local properties of a phylogenetic circle is the same as the properties of phylogenetic tree, 
but we can not be aware of the global constraint in the local scenario. 

Dynamics of genome size evolution. Due to the definite different topological properties 
between a circle and a tree, a mechanism of genome size evolution on a circle will quite differ 
from the traditional mechanisms of genome size evolution based on phylogenetic trees. The 
genome size can not increase unlimitedly when evolving in a circular pathway, but it can go 
to infinity in an unlimitedly branching pathway. So there is intrinsic mechanism to reduce the 
genome size which closely relates to the protein evolution. According to Eqn. (19) and Eqn. 
(20), we can calculate the genome size and the non-coding DNA size by p and 9, both of which 
are based on the data of protein length distributions. These experimental formulae indicate that 
protein evolution principally drives the genome size evolution. 

The driving force in genome size evolution is bidirectional. Large scale gene duplica- 
tions and accumulation of transposable elements are primary contributors in genome expansion. 
Whereas the reverse mechanism to reduce genome size was less understood. According to our 
scenario of phylogenetic circles, the global circular relationship in a domain must constrain 
the genome expansion. More explicitly speaking, genome size of a species has to evolve in 
the community of the domain; it can not evolve independently. In our previous work on the 
trend of genome size evolution, there are two dynamic factors (promotor 77 and hinderer 9) in 
determining the genome size evolution, where 77 corresponds to the process of polyploidy and 
accumulation of transportable elements and 9 indicates the relationships among species. So the 
biological meaning of the formula on genome size evolution just agrees with the explanation 
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based on the scenario of phylogenetic circles. 

Explanation of genome size distribution. According to the bifurcation of genome size 
distribution in AZ — s plane, it is easy to explain the patterns of genome size distributions among 
taxa. There are two main types of genome size distributions: single-peak type and double-peak 
type. For single-peak type, the species in a certain taxon distribute on only one side of wing of 
the butterfly shaped distribution in A/ — s plane, so the outline of the genome size distribution 
among this taxon has only one main peak. For double-peak type, the species in a certain taxon 
distribute on both wings of the butterfly shaped distribution in A/ — s plane, so the outline of 
the genome size distribution among this taxon has two main peaks. For examples, the genome 
size distributions for Eukaryotes or for Alphaproteobacteria belong to double-peak type, and 
the genome size distributions for Archaea or for Epsilonproteobacteria belong to single-peak 
type (Fig. 5). The genome size distributions for eukaryotic taxa belong to single-peak type ifTSl 
ll26l . because eukaryotes all distribute on the right wing of the butterfly shaped distribution in 
Al — s plane. 

On plant genome size evolution. Recent studies have made significant advances in under- 
standing the mechanism of plant genome size evolution, where polyploidy and the accumula- 
tion of transposable element plays significant roles in plant genome expansion, although less is 
known about the process for DNA removal JlH [|271 [IH ll29l lEOl. It is reported that "different 
land plant groups are characterized by different C-value profiles, distribution of C-values and 
ancestral C-values" [|26ll . Li the viewpoint of phylogenetic circles, different land plant groups 
should situate on the Eukaryotic phylogenetic circle and form a circular chain that is similar to 
the bacterial circular chain in Fig. 4b. So it is natural to draw the above conclusion, because (i) 
ancestral C-values should spread around on the circular chain and (ii) the "local" properties on 
genome size evolution at different places on the phylogenetic circle should also differ among 
different plant groups. 
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The formulae on the trend of genome size evolution can explain the coding DNA and 
non-coding DNA interactions in genome evolution. Entire genome duplication contributes the 
majority of genome size increase in plants. Simple chromosome duplication may double the 
genome size on the left hand of Eqn. (17) or (19). However, the values of 7] and 6^ or p on the 
right hand of Eqn. (17) or (19) keep invariant because the protein length distribution does not 
change in simple chromosome duplication. The apparent contradiction to the trend of genome 
size evolution will urge the alternation of coding DNA so that r] and p tend to increase in after 
the chromosome duplication. Such evolutionary pressure agrees with the experimental obser- 
vations. After duplication, the two copies of the gene are redundant. Because one of the copies 
is freed from functional constraint, mutations in this gene will be selectively neutral and will 
most often turn the gene into a nonfunctional pseudogene [fTSl . Hence, the ratio of non-coding 
DNA T] will increase. On the other hand, gene duplication can provide source of material for the 
origin of new genes with to alternative length [28]. Consequently, the protein length distribution 
will change to be more complex and the peak number p will be urged to increase. So p intrinsi- 
cally measures the protein evolution. Due to the rapid adjustment shortly after the chromosome 
duplication, the genome size can come back to the trend of genome size evolution as described 
in Eqns (17) and (19). 

It is also reported that more ancient land plants tended to have smaller genome sizes [|26l . 
Our theory on genome size evolution agrees with this experimental observation. According to 
Eqn. (21), the overall trend of genome size increased exponentially with respect to time, so 
more ancient life tended to have smaller genome size. 

A roadmap to transform bifurcated distribution in A/ — s plane to phylogenetic circles 
in Al — p plane. The distribution of species in Al — p plane is about circular, while the 
distribution of species in A/ — s plane is about random in the butterfly shaped area. However, 
there are intrinsic relationship between the distributions of species in A/ —p plane and in A/ — s 
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plane. According to Eqn. (19), we can not directly explain the deformation form the butterfly 
shaped distribution in Fig. 3b to the circular distribution in Fig. 3a. In the foUowings, we 
show that there are interesting relationships among different schemes of clustering of species 
based on different properties such as rj, 6, s, Al and I etc. The clustering analysis can help us 
understand the classification of life. 

On one hand, according to the approximately proportional relationship between Al and I 
(Fig. 3c), it is easy to understand the similarity between the distribution of species in s — T plane 
(Fig. 3d) and the distribution in s — A/ plane (Fig. 3b). The relationship between Al and T can 
be explained by the fact that all the profiles of the protein length distributions are similar, which 
relates to stochastic process [|3T1l ll32l . Next, we can explain the mirror symmetry with respect 
to a horizontal line between the distribution in s — I plane (Fig. 3d) and the distribution in 
s — 1] plane (Fig. 3f) according to the coarse linear relationship between T and r] for prokaryotes 
(Fig. 3e). Furthermore, we know that the distribution of species inp — r] plane is similar to the 
distribution ins — rj plane according to Eqn. (19). 

In a previous work [|20ll . we have explained the transformation from the symmetric distribu- 
tion of species inr] — 9 plane (Fig. 3g) to the asymmetric distribution of species inr] — s plane 
(Fig. 3f) according to Eqn. (17). The parameters rj and 9 play promoter and hinderer roles in 
genome size evolution. We assume that only parameters in a certain area in — 6* plane are 
selected by the mechanism in genome evolution, which results in the bifurcated distribution in 
Al — s plane. 

On the other hand, we observed the circular structures in the distribution of species inr] — Al 
plane (Fig. 3i). So, the distribution of species in p—Al plane becomes circles rather than random 
when we transform the distribution of species in p — 77 plane to the distribution of species in 
p — Al plane (Fig. 3a). Thus, we found a chain of transformations from the distribution of 
species in s — Al plane to distribution of species in p — Al plane, which can transform the 
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butterfly shaped distribution in Fig. 3b to the circular distribution in Fig. 3a. 

6 Classification of life by correlation and quasi-periodicity of 
protein length distributions. 

Cluster analysis of protein length distributions. We propose a new method to classify life on 
this planet, which is based on cluster analysis of protein length distributions. Unsurprisingly, 
our results agree with the proposal of three-domain classification, because the information in 
the fluctuations of protein length distributions also comes from the information in the molecular 
sequences. Interestingly, we shown again that the fluctuations of protein length distributions can 
not be taken as random fluctuations, which are essential in clustering species. Some standard 
cluster analysis methods in the theory of multivariation data analysis are applied to classify the 
protein length distributions of the species in PEP. We introduced average correlation efficient 
R{a), average Minkowski distance D{a), average protein length l{a) and peak number p{a) 
etc. for each species a in PEP (see Definitions and notations). All of the above quantities can 
be calculated only based on the data of protein length distributions. Three domains (Bacteria, 
Archaea and Eucarya) can be separated successfully according to the distributions of species in 
the plots of the relationships among these quantities. 

Firstly, we studied the distribution of species inl — p plane, where I and p only depend the 
data of the species's own. We found that the groups of species in Bacteria, Archaea and Eucarya 
cluster together in three regions respectively (Fig. 6a). The archaea cluster in a small region 
where I and p are relatively small; the bacteria cluster in a region where I and p are relatively 
middle; and the eukaryotes cluster in a region where I and p are relatively large. Thus, we 
have a new method to classify life. If the protein length distribution of a species is known 
but its classification if unclear, we can calculate average protein length I and peak number p 
of this species. Then we can determine which domain the species belongs to according to its 
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position in the / — p plane. Generally speaking, there is a correlation for the three domains: 
large p corresponds to large I. Such a correlation, however, is invalid for the species in the same 
domain. The relationship between p and the genome size s is much closer than the relationship 
between p and average protein length I (Comparing Fig. 2a and Fig. 6a). If considering only 
one quantity, either I or p, we can not separate archaea from bacteria. 

Secondly, we studied the distribution of species in _R — logio ^ plane, where R and D 
depend the data of other species according to their definitions, where the groups of species in 
three domains also cluster together respectively (Fig. 6b). The cluster of eukaryotes is separated 
obviously. The small region of the archaea borders on the big region of Eubacteria, so Archaea 
and eubacteria can still be separated. In the above, we chose the parameter q = 1/4 in the 
definition of Minkowski distance d{a,P) and accordingly calculate the average Minkowski 
distance D. According to this choice of parameter q, we can separate the three domains more 
easily only by the average Minkowski distance D. The results are alike if varying q from 1/2 
to 1/8 in calculating D. 

At last, we studied the distribution of species in other plots. According to the distributions 
of species in the plots oil — R,l — log D and D — p, we found that the groups of species in 
three domains still cluster together in the corresponding plots respectively (Fig. 6c). 

Cross-validated ROC analysis. The cross-validated receiver operating characteristic (ROC) 
analysis can be taken as an objective measure to check for the quality of the above cluster anal- 
ysis [[33l . For instance, we can check the validity of the method in the cluster analysis between 
Bacteria and Archaea by R and log^Q D in the following. We found that the cross-validated 
ROC curves deviate from the diagonal line obviously, which shows the validity of our methods 
to cluster species according to the properties of their protein length distributions (Fig. 6d). 

The method to draw the cross-validated ROC curve is as follows in detail. Firstly, we ran- 
domly separated the species in PEP into two groups Gi and G2. There are 42 bacteria, 7 archaea. 
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3 eukaryotes and 1 viruses in group Gi and the remaining 53 species are in group G2 (Fig. 6d). 
Only based on the biological data of the species in Gi, we can define corresponding average 
correlation coefficient R* and average Minkowski distance D*. According to the distribution 
of species in Gi in the R* — logiQ D* plane, the boundary between Bacteria and Archaea can 
be marked according to the distributions (R*, log^Q D*) for the species in Gi. Then we can 
calculate the correlation coefficient r{a,P) and Minkowski distance d{a, (3) between each of 
the species a e G2 and the species P e Gi, and accordingly obtain their average values for 
each species a e G2 



Still in the R* — log^g ^* plane, we obtained a group of dots {R**, log^g ^**) species in 
G2. Some of the archaea in G2 still belong to the region of archaea according to the boundary 
defined by the data of species in Gi, while other archaea in G2 cross the boundary. We can 
obtain the cross validated ROC curve according to the validity of cluster analysis for the species 
in G2 by shifting the position of the boundary (Fig. 6e). We can repeat the above procedure 
after changing over the data between Gi and ^2- Then we obtained another cross-validated 
ROC curve. 



Characteristics of power spectrum. The evolution of protein length is a virgin field in the 
study of molecular evolution. Although the mechanism of the evolution of protein length is 
unknown, we observed order in the protein lengths such as the quasi-periodicity, long range 
correlation and the tendency for conservation of protein length in domains. In this paper, we 
try to study the properties of protein length distributions by spectral analysis. In the section of 
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7 Spectral analysis of protein length distributions 
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"Definitions and notations", we defined a power spectrum y(«) for any species a. We defined 
the characteristic frequency /c and the maximum frequency /,„, and we also defined the char- 
acteristic period and minimum period of the protein length distribution. For the domains 
Bacteria, Archaea and Eucarya, we defined the average power spectra y^, y"^ and y^ respec- 
tively. Considering additional quantities such as average protein length I, peak number p and 
non-coding DNA content 77, we observed some interesting correlations among these quantities. 
We show that there are correlations between protein lengths at different scales. 

The protein length hierarchy. Structures can be observed in the fluctuations of the protein 
length distributions. We found that there are correlations between the characteristic frequency 
fc and maximum frequency (Fig. 7a, 7c). The characteristic frequency fc increases with 
the maximum frequency fm, which is especially obvious for archaea and eukaryotes. There is 
also correlation between characteristic period and the minimum period Lm (Fig. 7b, 7d). 
The values of Lc and Lm are intrinsic properties of protein length distributions that are free 
from the choice of cutoff m. Hence we found that the characteristic period Lf. increases with 
the minimum period Lm, especially for archaea and eukaryotes. Such an intrinsic correlation 
between Lc and Lm shows that there is a hierarchy in protein lengths. There might be a general 
mechanism in the organization of protein segments, which results in that the long protein length 
period Lc varies with the short protein length period Lm for individual species. 

The constraint on average protein length. Comparing the fact that genome sizes range 
more than 1, 000, 000-fold in the species on the planet, the average protein lengths in proteomes 
(several hundreds a.a.) vary slightly. There is a tendency for conservation of protein length in 
Bacteria, Archaea and Eucarya respectively The average protein lengths in proteomes 
for Bacteria range from about 250 a.a. to about 350 a.a.; the values for Archaea are a little 
smaller; the values for Eucaryotes are around 500 a.a.. The protein lengths vary slightly while 
the genome size evolves rapidly. Such a sharp contrast awaits answers. One possible solution is 
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based on the understanding of evolutionary outlines of genome size and gene number from the 
beginning of life t ^ —3, 800 Myr to present t = 0. According to the theory in Ref. [|20ll . we 
can obtain a formula of the evolutionary outline of average protein length 



where the subscripts denote two stages in the evolution. This formula can explain the difference 
between genome size evolution and protein length evolution. Genome size increased rapidly, 
while the average protein length varied slightly and it even tended to decrease in each stage of 
the evolution. Our results agree with experimental observations in principle. The genome size 
was approximately proportional to the gene number before the time Tc, so the average protein 
lengths for prokaryotes should approximately keep constant in most time before Tc. Then both 
evolutionary speeds for gene number and genome size of eukaryotes shifted to new values after 
Tc, while the coding DNA content 1 — r] began to decrease. Such a transition of evolution 
of genome size and gene number around Tc can set an upper limit for the average protein 
lengths for eukaryotes in the following evolution. The constraint on protein lengths could also 
be explained in an alternative way. The spectral analysis of protein length distributions might 
be helpful for us to understand the intrinsic mechanism of protein length evolution in detail. 
According to the relationship between fc and t] and the relationship between /c and I, we can 
relate the evolution of average protein length fto the non-coding DNA content r], i.e., [tended 
to decrease when r] increased gradually. So the correlation of protein lengths can intrinsically 
constrain the average protein lengths in a certain range in the evolution. 

The distribution of species in fc — r] plane shows a regular pattern: the value of fc tends 
to go from middle frequency to either lower frequency or higher frequency when r] increases 
gradually from about 0.1 to 1 (Fig. 8b). The same tendency of fc can be observed in fc—p plane 
(Fig. 8c) when p increases gradually. The tendency of fc can be observed clearly especially 
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according to the distributions of eukaryotes in the above. The mechanism constraining the 
average protein length can be inferred by the rainbow-like distribution of species in fc — I 
plane, where the species in Bacteria, Archaea and Eucarya gathered in three horizontal convex 
arches respectively (Fig. 8a). Such an order shows that the average protein length [ tends to 
evolve from long (corresponding middle fc) to short (corresponding lower or higher fc). We 
can observe directly that I decreases when rj increases (Fig. 3e), whose intrinsic mechanism, 
however, should be revealed by spectral analysis. 

Average power spectra and phylogeny of three domains. We can study the properties 
of average power spectrum for Bacteria, Archaea and Eucarya respectively, which reflects the 
phylogeny of three domains. An important characteristic can be observed that the bottoms of 
the profiles of the average power spectra are either "convex" or "concave". According to the 
results by several different ways to smooth the average power spectra y'', and y^, we always 
concluded that the profiles of the average power spectra of Archaea and Eucarya have "convex 
bottoms" while the profile of the average power spectrum of Bacteria has "concave bottom", 
where the "bottom" refers to the profile of power spectrum at / around m/2 (Fig. 9). It is 
well known that the relationship between Archaea and Eucarya is closer than the relationship 
between Archaea and Bacteria. So the property of the outlines of the average spectra agrees 
with the phylogeny of the three domains. A convex bottom indicates that the power spectrum 
in the high frequency sector (at / ~ 1500) prevails the power spectrum in the low frequency 
sector (at / ~ 500); while a concave bottom indicates the opposite case. So the differences in 
the "bottoms" of power spectra of three domains might result from the underlying mechanism 
of protein length evolution. 

In the above, the outlines of the average power spectra are obtained by smoothing the av- 
erage power spectra in two methods. In the first method, we can smoothen y'', y" and y^ as 
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where 2 w + 1 is the width of the averaging sector and the range of/is/ = l + w,...,m — w. We 
obtain two sets of outlines of the average power spectra Y^{w), Y'^{w) and Y^{w) (w = wi = 
100 or w = W2 = 300) in the averaging calculations (Fig. 9a-9c). In the second method, we 
use the Savitzky-Golay method ll34ll to obtain outlines of the average power spectra Jscioi) for 
each species. Then, we averaged yscio:) for Bacteria, Archaea and Eucarya respectively and 
denote the resuhs as Yl^-, Y^^^ and Y^^^ (Fig. 9d-9f). We found that all the outlines Y^(si), 
Y''(s2) and Yg^ for Bacteria have concave bottoms and the corresponding outlines for Archaea 
and Eucarya have convex bottoms. 

8 Conclusion and discussion 

We conclude that the classification of life can be studied according to the understanding of fun- 
damental mechanism of genome size evolution. The phylogenetic relationship among species 
in a domain is circular rather than the traditional concept of branching trees. The phylogenetic 
circle is a global property of living systems at the level of domain. We propose a natural cri- 
terion to define a domain by each of the phylogenetic circles. We observed at least three main 
phylogenetic circles corresponding to three known domains. In the global scenario of phylo- 
genetic circles, we can explain the driving force in genome size evolution and the patterns of 
genome size distributions. The peak number p plays the role of net driving force in genome size 
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evolution. The genome size concerns two factors: (i) the net driving force p, and (ii) the circu- 
lar phylogenetic relationship in a domain. Thus, there is no trivial correlation between genome 
size and biological complexity. The global circular relationship is quite different from the lo- 
cal branching relationship. The underlying mechanism in origin and evolution of life should 
consider that a domain should evolve as a whole. 

There is rich evolutionary information stored in the fluctuations of protein length distri- 
butions. In the past, the fluctuations in protein length distribution were routinely assumed as 
random ones in a smooth background. Such a prejudice may result in the neglect of the pivotal 
evolutionary information stored in the fluctuations of protein length distributions. Based on the 
biological data of protein lengths in a proteome, we can calculate the genome size as well as the 
ratios of coding DNA and non-coding DNA for a species. Our results agree with the biological 
data very well. So there is profound relationship between the evolution of non-coding DNA and 
the evolution of coding DNA. We reconfirm the three-domain classification of life by cluster 
analysis of protein length distributions. We found that there are correlations between long peri- 
ods and short periods of protein length distributions. The validity of our results can be verified 
by objective measures, which shouldn't be ascribed to accidental coincidences. The study on 
protein length distributions provides us a chance to understand the macroevolution of life. 

There should be a universal mechanism which underlies the molecular evolution. The fluc- 
tuations in protein length distributions may result from this universal mechanism. Thus we can 
determine the position of a species in the evolution of life by correlation analysis of the pro- 
tein length distributions, and therefore obtain a panorama of evolution of life. There are many 
analogies between protein language and natural language of human being. We conjecture that 
linguistics may play a central role in the protein length evolution. A linguistic model was made 
to study the protein length evolution. In this model, protein sequences can be generated by 
grammars, hence we can obtain simulated protein length distributions for a set of grammars. 
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The average protein lengths I and the peak numbers p can be calculated consequently. The cor- 
relation between peak numbers p and average protein length I in experimental observation can 
be explained by the simulation. Our results indicate an intrinsic relationship between the com- 
plexity of grammars in protein sequences and the peak numbers in protein length distributions. 
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Figure 1 : Protein length distribution and power spectrum of E. coli. a, Protein length distribution 
x(E. coli). b, The power spectrum y(E. coli), the characteristic frequency fc at the highest peak is 
marked. 
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Figure 2: Prediction of genome size and non-coding DNA content, a, p is proportional to logio ^ 
(Correlation coefficient is 0.9428). b, The non-coding DNA predicted by the formula agrees with the 
biological data (Correlation coefficient is 0.9468). c, The relation between p and rj. d, The relation 
between p and 9. e, The relation between p and logio ^ is not linear (Correlation coefficient is 0.8747). 
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Figure 3: The mechanism of genome evolution can be inferred by the circular distribution of 
species in Al —p plane, a, The phylogenetic circles consisted of species in Archaea, Eukarya, Eubacteria 
and Mycoplasma. The fundamental relationship goes round in circles in each domain, b, Species only 
distribute in the left and right quadrants. The origin O is at s* ~ 3.5 x 10^ bp. c, The approximate 
proportional relation between / and Al (Correlation coefficient is 0.9022). d, The distribution of species 
in s — [plane, e, The coarse linear relation between T and rj in each domain, f, The distribution of species 
ins — r] plane, g, The distribution of species inO — r] plane, h, The distribution of species in logio N — Al 
plane, h, Relationship between non-coding DNA ratio and protein length standard error. 
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Figure 4: Phylogenetic circles in Al —p plane and bifurcated distribution in A/ — s plane. The abbre- 
viation of the names of groups of species are as follows: Euryarchaeota (EA), Crenarchaeota (CA), Al- 
phaproteobacteria (Pa), Betaproteobacteria (P/3), Gammaproteobacteria (P7), Deltaproteobacteria (Pd), 
Epsilonproteobacteria (Pe), Firmicutes (Fir), Acidobacteria (Aci), Actinobacteria (Act), Cyanobacte- 
ria (Cya), Bacteroidetes/Chlorobi (Chi), Spirochaetes (Spi), Chlamydiae/Verrucomicrobia (Ver), Chlo- 
roflexi (Cho), Deinococcus-Thermus (Dei), Thermotogae (The), a, Phylogenetic circles formed by 775 
microbes in NCBI. b, The phylogenetic circle formed by phyla of eubacteria. c, The phylogenetic circle 
formed by proteobacteria. d, Bifurcated distribution of species in A/ — s plane, e, Bifurcated distribution 
of proteobacteria. 
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Figure 5: Genome size distributions. The width of each genome size section is 3.5 x 10^ bp. 
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Figure 6: Classification of life based on cluster analysis of protein length distributions, a, The 

distributions of species in p — T plane, b, The distributions of species in i? — logio D plane, c, The 
distributions of species in I — logio D plane, d, Cross-validation analysis of the classification between 
Bacteria and Archaea by the distribtion in i? — logio ^ plane. The species in group Gi are red, and the 
others in G2 are blue, e, The cross-validated ROC curve shows the validity of the classifier 
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Figure 7: The protein length hierarchy. We choose Up = 80 (up = 30) top highest peaks to find the 
largest frequency /^^ (/^) of obvious peaks, and consequently obtaining (L'^)- a fm approximately 
increases with fc for Archaea and Eukarya. b approximately increases with Lc for Archaea and 
Eukarya. c varies with fc in waves for Archaea and Eukarya. d L'^ varies with Lc in waves for 
Archaea and Eukarya. . 
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Figure 8: Explanation of the constraint on the average protein lengths, a, The rain-bow like dis- 
tributions of species in three domains. The arc of Archaea is at lowest; the arc of Eubacteria is in the 
middle; and the arc of Eukarya is on the top. Mycoplasmas also form an arc. b, The relationship between 
r] and fc- The V-shaped distribution of Eukaryotes is obvious, c, The relationship between p and fc- The 
V-shaped distribution of Eukaryotes is also obvious, and the distributions of Archaea and Eubacteria 
form flat arc respectively. 
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Figure 9: The profiles of total power spectra for three domains. After smoothing, the bottoms of 
total power spectra are convex for Archaea and Eucarya but concave for Bacteria, a-c, Smoothing by 
averaging in neighboring sections. Neighboring section wi = 100 (Blue) and W2 = 300 (Red), d-f, 
Smoothing by Savitzky-Golay method (span=501, degree=2). 
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Table 2: List of the species in PEP 



(No. 1) Acholeplasma florum (Mesoplasma florum) domain: Eubacteria 

(No. 2) Acinetobacter sp (strain ADPl) domain: Eubacteria 

(No. 3) Aeropyrum pemix Kl domain: Archaebacteria 

(No. 4) Agrobacterium tumefaciens (strain C58 / ATCC 33970) Eubacteria 

(No. 5) Agrobacterium tumefaciens domain: Eubacteria 

(No. 6) Aquifex aeolicus domain: Eubacteria 

(No. 7) Arabidopsis thaliana domain: Eukaryote 

(No. 8) Achaeoglobus fulgidus domain: Archaebacteria 

(No. 9) Bacillus anthracis (strain Ames) domain: Eubacteria 

(No. 10) Bacillus cereus (ATCC 14579) domain: Eubacteria 

(No. 1 1) Bacillus subtilis domain: Eubacteria 

(No. 12) Bacteroides thetaiotaomicron VPI-5482 domain: Eubacteria 

(No. 13) Bartonella henselae (Houston-1) domain: Eubacteria 

(No. 14) Bartonella quintana (Toulouse) DOMAIN: Eubacteria 

(No. 15) Bdellovibrio bacteriovorus domain: Eubacteria 

(No. 16) Bordetella bronchiseptica RB50 DOMAIN: Eubacteria 

(No. 17) Borrelia burgdorferi domain: Eubacteria 

(No. 18) Bordetella parapertussis domain: Eubacteria 

(No. 19) Bordetella pertussis domain: Eubacteria 

(No. 20) Bradyrhizobium japonicum domain: Eubacteria 

(No. 21) Brucella melitensis; B melitensis; brume domain: Eubacteria 

(No. 22) Buchnera aphidicola (subsp. Acyrthosiphon pisum) Eubacteria 

(No. 23) Buchnera aphidicola (subsp. Schizaphis graminum) Eubacteria 

(No. 24) Buchnera aphidicola (subsp. Baizongia pistaciae) Eubacteria 

(No. 25) Caenorhabditis elegans domain: Eukaryote 

(No. 26) Campylobacter jejuni domain: Eubacteria 

(No. 27) Candidatus Blochmannia floridanus domain: Eubacteria 

(No. 28) Caulobacter crescentus domain: Eubacteria 

(No. 29) Chlamydophila caviae domain: Eubacteria 

(No. 30) Chlamydia muridarum domain: Eubacteria 

(No. 31) Chlorobium tepidum domain: Eubacteria 

(No. 32) Chlamydia trachomatis domain: Eubacteria 

(No. 33) Chromobacterium violaceum ATCC 12472 domain: Eubacteria 

(No. 34) Clostridium acetobutylicum domain: Eubacteria 

(No. 35) Clostridium perfringens domain: Eubacteria 
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(No. 36 



Clostridium tetani domain: Eubacteria 



(No. 37 



Corynebacterium diphtheriae NCTC 13129 domain: Eubacteria 



(No. 38 



Corynebacterium efficiens 



Eubacteria 



(No. 39 



Corynebacterium glutamicum domain: Eubacteria 



(No. 40 



Coxiella burnetii domain: Eubacteria 



(No. 41 



Deinococcus radiodurans 



Eubacteria 



(No. 42 



Desulfovibrio vulgaris subsp. vulgaris str. Hildenborough Eubacteria 



(No. 43 



Drosophila melanogaster i 



Eukaryote 



(No. 44 



Escherichia coli 



Eubacteria 



(No. 45 



Enterococcus faecalis domain: Eubacteria 



(No. 46 



Erwinia carotovora domain: Eubacteria 



(No. 47 



Fusobacterium nucleatum domain: Eubacteria 



(No. 48 



Gloeobacter violaceus domain: Eubacteria 



(No. 49 



Haemophilus ducreyi 



Eubacteria 



(No. 50 



Haemophilus influenzae 



Eubacteria 



(No. 51 



Halobacterium sp. (strain NRC-1) domain: Archaebacteria 



(No. 52; 



Human cytomegalovirus (strain AD 169) domain: virus 



(No. 53 



Helicobacter heilmannii domain: Eubacteria 



(No. 54 



Helicobacter pylori domain: Eubacteria 



(No. 55 



Homo sapiens domain: Eukaryote 



(No. 56 



Lactobacillus johnsonii domain: Eubacteria 



(No. 57 



Lactococcus lactis (subsp. lactis) domain: Eubacteria 



(No. 58 



Lactobacillus plantarum WCFSl domain: Eubacteria 



(No. 59 



Leifsonia xyli (subsp. xyli) domain: Eubacteria 



(No. 60 



Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar 



Copenhageni) DOMAIN: Eubacteria 



(No. 61 



Listeria innocua i 



Eubacteria 



(No. 62 



Listeria monocytogenes domain: Eubacteria 



(No. 63 



Methanosarcina acetivorans domain: Archaebacteria 



(No. 64 



Methanopyrus kandleri 



Archaebacteria 



(No. 65 



Methanobacterium thermoautotrophicum domain: Archaebacteria 



(No. 66 



Methanobacterium thermoautotrophicum domain: Archaebacteria 



(No. 67 



Mus musculus 



Eukaryote 



(No. 68 



Murine herpesvirus 68 strain WUMS domain: virus 



(No. 69 



Mycobacterium avium; M avium; mycav domain: Eubacteria 



(No. 70 



Mycobacterium bo vis AF2 122/97 domain: Eubacteria 
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(No. 71) Mycoplasma galUsepticum domain: Eubacteria 

(No. 72) Mycoplasma genitalium DOMAIN: Eubacteria 

(No. 73) Mycoplasma mycoides (subsp. mycoides SC) domain: Eubacteria 

(No. 74) Mycoplasma pneumoniae domain: Eubacteria 

(No. 75) Mycoplasma pulmonis domain: Eubacteria 

(No. 76) Neisseria meningitidis domain: Eubacteria 

(No. 77) Nitrosomonas europaea domain: Eubacteria 

(No. 78) Oceanobacillus iheyensis domain: Eubacteria 

(No. 79) Porphyromonas gingivalis DOMAIN: Eubacteria 

(No. 80) Pseudomonas aeruginosa domain: Eubacteria 

(No. 81) Pseudomonas putida domain: Eubacteria 

(No. 82) Pyrococcus abyssi domain: Archaebacteria 

(No. 83) Pyrococcus furiosus domain: Archaebacteria 

(No. 84) Pyrococcus horikoshii domain: Archaebacteria 

(No. 85) Ralstonia solanacearum DOMAIN: Eubacteria 

(No. 86) Rhizobium loti domain: Eubacteria 

(No. 87) Rickettsia conorii domain: Eubacteria 

(No. 88) Schizosaccharomyces pombe domain: Eukaryote 

(No. 89) Shigella flexneri Shigella flexneri domain Eubacteria 

(No. 90) Staphylococcus aureus domain: Eubacteria 

(No. 91) Streptococcus agalactiae domain: Eubacteria 

(No. 92) Streptomyces coelicolor domain: Eubacteria 

(No. 93) Streptococcus pneumoniae domain: Eubacteria 

(No. 94) Streptococcus pyogenes domain: Eubacteria 

(No. 95) Sulfolobus solfataricus domain: Archaebacteria 

(No. 96) Thermoplasma acidophilum DOMAIN: Archaebacteria 

(No. 97) Thermotoga maritima DOMAIN: Eubacteria 

(No. 98) Treponema pallidum domain: Eubacteria 

(No. 99) Ureaplasma urealyticum domain: Eubacteria 

(No. 100) Vibrio cholerae DOMAIN: Eubacteria 

(No. 101) Vibrio parahaemolyticus RIMD 2210633 domain: Eubacteria 
(No. 102) Wolinella succinogenes domain: Eubacteria 

0. 103) Xanthomonas axonopodis (pv. citri); X axonopodis (pv. citri) DOMAIN: Eubacteria 

(No. 104) Xylella fastidiosa domain: Eubacteria 

(No. 105) Saccharomyces cerevisiae domain: Eukaryote 

(No. 106) Yersinia pestis domain: Eubacteria 
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